Contents
Introduction
In the previous article we discussed and learned how to reshape data with base R to a form that is practical for our use. In this one, we will look at basic data manipulation techniques, namely obtaining relevant subsets of our data. The key will be safety and avoiding complication and confusion as much as possible. This is why we:
- try to avoid using
subset
, as this function is implemented via non-standard evaluation. - also skip
$
as it uses partial matching and is hardly used with variables as column names. - not mention more details related to
list
properties ofdata.frames
here as the topic could get confusing. If you would like to go to more important detail, we strongly recommend a thorough read of the subsetting chapter of Hadley Wickham’s Advanced R
How to use this article
- This article is best used with an R session opened in a window next to it - you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed - it is a
case4base
after all! - If you have no time for reading, you can click here to get just the code with commentary
First, let’s read in yearly data on gross disposable income of household in the EU countries into R (click here to download):
gdi <- read.csv(
stringsAsFactors = FALSE
, url("https://jozef.io/post/data/ESA2010_GDI.csv")
)
head(gdi[, 1:6, drop = FALSE])
## country Y.1995 Y.1996 Y.1997 Y.1998 Y.1999
## 1 EU 28 NA NA NA NA 5982392.8
## 2 Euro area 19 NA NA NA NA 4393727.3
## 3 Belgium 140734.1 141599.4 145023.2 149705.2 153804.0
## 4 Bulgaria 1036.0 1468.1 12367.4 14921.1 16052.8
## 5 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0 1280040.0
## 6 Denmark 566363.0 578102.0 591416.0 621236.0 614893.0
Please note that the figures in the data provided by Eurostat are presented in millions of euros for euro area countries, euro area and EU aggregates and in millions of national currency otherwise. This makes comparing the results between countries difficult, since one would need to do a proper time-dependent currency conversion and potentially inflation adjustment to get comparable data.
The goal of the article is therefore not really in presenting these concrete results, but to focus on the technical aspects and usefulness of the presented methods.
Selecting (subsetting) relevant data from a data.frame
In this paragraph, we will try to show how to subset with as little hassle as possible while preserving the maximum safety in your operations. We shall go into more detail later in the article. The standard approach to subsetting data.frames
can be summarised:
dataframe_name[row_subset, col_subset, drop = FALSE]
Where:
dataframe_name
is the name of thedata.frame
we are subsettingrow_subset
is a vector specifying the subset of rowscol_subset
is a vector specifying the subset of columnsdrop = FALSE
is to make sure the result does not get simplified when not indented. This should always be used, unless we specifically want to simplify the result (e.g. to a vector for indexing)
Constructing meaningful subsets simply and safely
In practice, we of course will most of the time not select rows and/or columns with positions known apriori, but based on more variable conditions. For this purpose, the advised way would be is to construct logical vectors:
Let us now subset the rows of our data to get the data for countries that have known (not NA
) value in the year 2016 and this value is less than 1 million:
rowidx <- !is.na(gdi[, "Y.2016"]) & gdi[, "Y.2016"] < 1000000
gdi[rowidx, c(1, 23), drop = FALSE]
## country Y.2016
## 3 Belgium 243825.50
## 4 Bulgaria 60237.00
## 8 Estonia 12548.30
## 9 Ireland 97318.90
## 11 Spain 698701.00
## 13 Croatia 0.00
## 16 Latvia 15737.79
## 17 Lithuania 24743.49
## 18 Luxembourg 20155.80
## 21 Netherlands 357383.00
## 22 Austria 214980.60
## 24 Portugal 128789.39
## 26 Slovenia 24756.63
## 27 Slovakia 48882.91
## 28 Finland 126590.00
## 33 Switzerland 458641.00
Note that when creating the
rowidx
we omitted thedrop = FALSE
despite the aforementioned best practice. This is because in this particular case we consciously welcome the result being simplified to a vector, as its use is only as an index for subsetting.
More ways to provide subset indices
Subsetting can be done in a few ways. We will now use them to show a subset the first two and the 27th row and the first, 22nd and 23rd column, giving us the GDI for EU28, Euro Area 19 and Slovakia in the years 2015 and 2016:
- Logical vectors
TRUE
for rows/columns to subset,FALSE
for those to omit
st1 <- gdi[c(TRUE, TRUE, rep(FALSE, 24), TRUE, rep(FALSE, 8))
, c(TRUE, rep(FALSE, 20), rep(TRUE, 2))
, drop = FALSE
]
- Numeric vectors of row/column numbers to subset
st2 <- gdi[c(1:2, 27)
, c(1, 22:23)
, drop = FALSE
]
- Negative numeric vectors of row/column numbers to omit
st3 <- gdi[c(-3:-26, -28:-35)
, c(-2:-21)
, drop = FALSE
]
- Character vectors of row/column names to subset
st4 <- gdi[c("1", "2", "27") # we do not have very meaningful rownames
, c("country", "Y.2015", "Y.2016")
, drop = FALSE
]
st4
## country Y.2015 Y.2016
## 1 EU 28 9439578.39 9454683.60
## 2 Euro area 19 6598231.27 6736686.43
## 27 Slovakia 47464.71 48882.91
- All of the above give identical results
identical(st1, st2) && identical(st2, st3) && identical(st3, st4)
## [1] TRUE
Tips
- The above methods are also working and safe for matrices
- Negative and positive numeric vectors cannot be combined
Alternatives to base R
- dplyr::select and dplyr::filter
- Using data.table
TL;DR - Just want the code
No time for reading? Click here to get just the code with commentary
Exercises
- What is the difference between
gdi[3, 3]
andgdi[3, 3, drop = FALSE]
? - What is the difference between
gdi[-3, 3]
andgdi[3, -3]
? What aboutgdi[-3, 3, drop = FALSE]
? - Why cannot we omit the first part of the & in
rowidx <- !is.na(gdi[, "Y.2016"]) & gdi[, "Y.2016"] < 1000000
. What would happen if we just didrowidx <- gdi[, "Y.2016"] < 1000000
? - Bonus question 1: Why is
identical(gdi[, "Y.2016", drop = FALSE], gdi["Y.2016"])
- Bonus question 2: Why is
identical(gdi[, "Y.2016"], gdi[["Y.2016"]])
References
- Advanced R’s chapter on subsetting
- and on data types
- original eurostat data source